Data processing and feature engineering

data<- read.csv('palm_ffb.csv')

data$Date <- as.Date(data$Date,format = "%d.%m.%Y")
data$temperature_range <- data$Max_Temp-data$Min_Temp
data$month <- as.numeric(format(data$Date,"%m"))
data$year <- as.numeric(format(data$Date,"%Y"))
head(data)
##         Date SoilMoisture Average_Temp Min_Temp Max_Temp Precipitation
## 1 2008-01-01        616.4     25.30645     21.3     32.2         184.4
## 2 2008-02-01        568.9     26.16552     20.9     35.1         140.2
## 3 2008-03-01        577.6     25.44839     21.3     32.9         280.4
## 4 2008-04-01        581.1     26.90333     20.6     34.8         173.3
## 5 2008-05-01        545.4     27.24194     20.9     35.0         140.6
## 6 2008-06-01        532.5     27.11667     21.4     35.5         182.3
##   Working_days HA_Harvested FFB_Yield temperature_range month year
## 1           25     777778.4      1.62              10.9     1 2008
## 2           23     767988.3      1.45              14.2     2 2008
## 3           25     783951.9      1.56              11.6     3 2008
## 4           25     788987.1      1.39              14.2     4 2008
## 5           25     813659.7      1.44              14.1     5 2008
## 6           24     829817.6      1.48              14.1     6 2008

I realized that the temperature fluctuation might affect the tree yield so i created a new feature called temperature range. My hypothesis is that if the temperature range is huge, the yield might be low.

There is an issue with the precipitation data. The total precipitation in february 2014 is 2 mm which is quite impossible. So i imputed the value with the mean of the precipitation for all february data points

data$Precipitation[data$Precipitation<5] <- mean(data$Precipitation[data$month==2])
price<- read.csv('oil_palm.csv')
data_new <- cbind(data,price)
data_new$Price <- as.character(data_new$Price)
data_new$Price <- gsub(",","",data_new$Price)
data_new$Price <- as.numeric(data_new$Price)
head(data_new)
##         Date SoilMoisture Average_Temp Min_Temp Max_Temp Precipitation
## 1 2008-01-01        616.4     25.30645     21.3     32.2         184.4
## 2 2008-02-01        568.9     26.16552     20.9     35.1         140.2
## 3 2008-03-01        577.6     25.44839     21.3     32.9         280.4
## 4 2008-04-01        581.1     26.90333     20.6     34.8         173.3
## 5 2008-05-01        545.4     27.24194     20.9     35.0         140.6
## 6 2008-06-01        532.5     27.11667     21.4     35.5         182.3
##   Working_days HA_Harvested FFB_Yield temperature_range month year  Price
## 1           25     777778.4      1.62              10.9     1 2008 1059.0
## 2           23     767988.3      1.45              14.2     2 2008 1160.0
## 3           25     783951.9      1.56              11.6     3 2008 1249.0
## 4           25     788987.1      1.39              14.2     4 2008 1174.0
## 5           25     813659.7      1.44              14.1     5 2008 1207.5
## 6           24     829817.6      1.48              14.1     6 2008 1213.0

The price of oil palm might also be another external factor that might be affecting yield indirectly in the long term. This can be due to acquisition of more land, incentives etc.

I obtained the price of CPM from https://www.indexmundi.com/commodities/?commodity=palm-oil&months=180

Next, Below is the correlation plot of all the variables.

Correlation Plot

library(corrplot,verbose = FALSE)
## corrplot 0.84 loaded
corr <- cor(data_new[,2:ncol(data_new)])
corrplot(corr, method="number")

Adding month and year as a feature helps with the correlation. month is very positively correlated with the FFB Yield(0.63). Month is also very positively correlated to precipitation(0.44) which correlates highly with FFB Yield.

Temperature range has improved the correlation with precipitation compared to max temperature. The FFB Yield have a positive correlation with precpitation(0.29) and slightly negative correlation with price(-0.21)

Next, im plotting all the variables against the FFB Yield to see if there are any significant patterns not captured by the correlation plot. (I’m keeping in mind that this ignores the factor of time and but just to serve my curiosity, i will plot scatterplots)

Scatterplots

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
scatter_fun = function(x, y) {
     ggplotly(ggplot(data_new, aes(x = .data[[x]], y = .data[[y]]) ) +
          geom_point() +
          geom_smooth(method = "loess", se = FALSE, color = "grey74") +
          theme_bw() +
          labs(x = x,
               y = y))
}

scatter_fun("SoilMoisture","FFB_Yield")
scatter_fun("Average_Temp","FFB_Yield")
scatter_fun("Min_Temp","FFB_Yield")
scatter_fun("Max_Temp","FFB_Yield")
scatter_fun("temperature_range","FFB_Yield")
scatter_fun("Precipitation","FFB_Yield")
scatter_fun("Working_days","FFB_Yield")
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 25
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
scatter_fun("HA_Harvested","FFB_Yield")
scatter_fun("year","FFB_Yield")
scatter_fun("Price","FFB_Yield")
scatter_fun("month","FFB_Yield")

From all the scatterplots, month has a very interesting pattern with the FFB Yield. The FFB Yield seem to increase from february to october then in plunges unitil january. If this data is from Malaysia, It could be because of the monsoon season that affects us towards the end of the year.

Time series analysis

Seasonality study

library('forecast')
library('tseries')
count_ma = ts(na.omit(data_new$FFB_Yield), start=c(2008,1),end=c(2018,1),frequency=12) 
Acf(count_ma, main='')

ACF plot is used to check for seasonality. It shows how well the present value of the series is related with its past values/lag values. The blue horizontal lines are the confidence intervals (95%) Since there are alot of lines outside the 95% confidence interval, there is not enough evidence to claim that there is no seasonality.

decomp <- function(col){
  
  count_ma = ts(na.omit(col), start=c(2008,1),end=c(2018,1),frequency=12) 
  decomp = stl(count_ma, s.window="periodic")
  return(plot(decomp))
  
}

decomp(data$FFB_Yield)

Next i tried to analyse the yield by decomposing it into the seasonal component, the trending component and the white noise component.

Notice that in the seasonal section, there is a cycle every 12 months. It will be interesting to see the plot comparison on a yearly basis.

The Trend chart is mostly constant from 2008 to 2016. Then it picks up in 2017 to 2018. This could be associated with maturity of the oil palms or improvement in process etc.

Next im going to extract the seasonal component from all the variables to see if it affects the FFB Yield(actual component)

seasonal <- as.data.frame(data_new$FFB_Yield[1:121])

count_ma = ts(na.omit(data_new$SoilMoisture), start=c(2008,1),end=c(2018,1),frequency=12)
decomp = stl(count_ma, s.window="periodic")
decompi = as.data.frame(decomp$time.series)
seasonal$SoilMoisture <- decompi$seasonal
seasonal <-as.data.frame(seasonal)

# Precipitation
count_ma = ts(na.omit(data_new$Precipitation), start=c(2008,1),end=c(2018,1),frequency=12) 
decomp = stl(count_ma, s.window="periodic")
decompi = as.data.frame(decomp$time.series)
seasonal$Precipitation <- decompi$seasonal

# HA_Harvested
count_ma = ts(na.omit(data_new$HA_Harvested), start=c(2008,1),end=c(2018,1),frequency=12) 
decomp = stl(count_ma, s.window="periodic")
decompi = as.data.frame(decomp$time.series)
seasonal$HA_Harvested <- decompi$seasonal

# Working_days
count_ma = ts(na.omit(data_new$Working_days), start=c(2008,1),end=c(2018,1),frequency=12) 
decomp = stl(count_ma, s.window="periodic")
decompi = as.data.frame(decomp$time.series)
seasonal$Working_days <- decompi$seasonal

# FFB_Yield
count_ma = ts(na.omit(data_new$FFB_Yield), start=c(2008,1),end=c(2018,1),frequency=12) 
decomp = stl(count_ma, s.window="periodic")
decompi = as.data.frame(decomp$time.series)
seasonal$FFB_Yield <- decompi$seasonal

# Price
count_ma = ts(na.omit(data_new$Price), start=c(2008,1),end=c(2018,1),frequency=12) 
decomp = stl(count_ma, s.window="periodic")
decompi = as.data.frame(decomp$time.series)
seasonal$Price <- decompi$seasonal

# Min_Temp
count_ma = ts(na.omit(data_new$Min_Temp), start=c(2008,1),end=c(2018,1),frequency=12) 
decomp = stl(count_ma, s.window="periodic")
decompi = as.data.frame(decomp$time.series)
seasonal$Min_Temp <- decompi$seasonal

# Max_Temp
count_ma = ts(na.omit(data_new$Max_Temp), start=c(2008,1),end=c(2018,1),frequency=12) 
decomp = stl(count_ma, s.window="periodic")
decompi = as.data.frame(decomp$time.series)
seasonal$Max_Temp <- decompi$seasonal

# temperature_range
count_ma = ts(na.omit(data_new$temperature_range), start=c(2008,1),end=c(2018,1),frequency=12) 
decomp = stl(count_ma, s.window="periodic")
decompi = as.data.frame(decomp$time.series)
seasonal$temperature_range <- decompi$seasonal
library(dygraphs)
# Soil Moisture

data_time <- data_new[c("SoilMoisture","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "SoilMoisture") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
data_time <- seasonal[c("SoilMoisture","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "SoilMoisture") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()

The first chart shows the correlation of Soil Moisture with the noise and trending whereas the second chart shows the correlation of Soil Moisture without the noise and trending.

The charts above shows the correlation when the original Soil Moisture is used versus when only the seasonal component of Soil Moisture is used. The distance between the peaks and troughs shows how correlated the data is.

Below, I repeated the similar process for all the remaining variables

# Precipitation
data_time <- data_new[c("Precipitation","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "Precipitation") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
data_time <- seasonal[c("Precipitation","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "Precipitation") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
# HA_Harvested
data_time <- data_new[c("HA_Harvested","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "HA_Harvested") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
data_time <- seasonal[c("HA_Harvested","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "HA_Harvested") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
# Working_days
data_time <- data_new[c("Working_days","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "Working_days") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
data_time <- seasonal[c("Working_days","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "Working_days") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
# Price
data_time <- data_new[c("Price","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "Price") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
data_time <- seasonal[c("Price","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "Price") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
# Min_Temp
data_time <- data_new[c("Min_Temp","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "Min_Temp") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
data_time <- seasonal[c("Min_Temp","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "Min_Temp") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
# Max_Temp
data_time <- data_new[c("Max_Temp","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "Max_Temp") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
data_time <- seasonal[c("Max_Temp","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "Max_Temp") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
# temperature_range
data_time <- data_new[c("temperature_range","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "temperature_range") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()
data_time <- seasonal[c("temperature_range","FFB_Yield")]
timeseries <- ts(data_time,start=c(2008,1),end=c(2018,1),frequency=12)
dygraph(timeseries) %>%
  dyAxis("y", label = "temperature_range") %>%
  dyAxis("y2", label = "FFB_Yield", independentTicks = TRUE) %>%
  dySeries("FFB_Yield", axis = 'y2') %>% 
  dyRangeSelector()

The closer the peak and troughs are the more correlated the external factors are to the FFB Yield. Based on all the plots above, it is visually clear that Precipitation and Soil Moisture has the highest correlation to the FFB Yield. This is followed by hectar harvested. Working Days have the worst correlation with FFB Yield. All the temperature variables (min,max,range) have little correlation with the FFB_Yield. The overall yearly trend of the FFB_Yield could be also associated with the monsoon season towards the end of the year.